Evaluating classifier performance with highly imbalanced Big Data

نویسندگان

چکیده

Abstract Using the wrong metrics to gauge classification of highly imbalanced Big Data may hide important information in experimental results. However, we find that analysis for performance evaluation and what they can or reveal is rarely covered related works. Therefore, address gap by analyzing multiple popular on three tasks. To best our knowledge, are first utilize new Medicare insurance claims datasets which became publicly available 2021. These all imbalanced. Furthermore, comprised completely different data. We evaluate five ensemble learners Machine Learning task fraud detection. Random Undersampling (RUS) applied induce class ratios. The classifiers evaluated with both Area Under Receiver Operating Characteristic Curve (AUC), Precision Recall (AUPRC) metrics. show AUPRC provides a better insight into performance. Our findings AUC metric hides impact RUS. results terms RUS has detrimental effect. that, Data, fails capture about precision scores false positive counts reveals. contribution more effective evaluating when working Data.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Mining Imbalanced Data with Learning Classifier Systems

This chapter investigates the capabilities of XCS for mining imbalanced datasets. Initial experiments show that, for moderate and high class imbalances, XCS tends to evolve a large proportion of overgeneral classifiers. Theoretical analyses are developed, deriving an imbalance bound up to which XCS should be able to differentiate between accurate and overgeneral classifiers. Some relevant param...

متن کامل

Evaluating Misclassifications in Imbalanced Data

Evaluating classifier performance with ROC curves is popular in the machine learning community. To date, the only method to assess confidence of ROC curves is to construct ROC bands. In the case of severe class imbalance with few instances of the minority class, ROC bands become unreliable. We propose a generic framework for classifier evaluation to identify a segment of an ROC curve in which m...

متن کامل

Two Stage Comparison of Classifier Performances for Highly Imbalanced Datasets

During the process of knowledge discovery in data, imbalanced learning data often emerges and presents a significant challenge for data mining methods. In this paper, we investigate the influence of class imbalanced data on the classification results of artificial intelligence methods, i.e. neural networks and support vector machine, and on the classification results of classical classification...

متن کامل

Evaluating Difficulty of Multi-class Imbalanced Data

Multi-class imbalanced classification is more difficult than its binary counterpart. Besides typical data difficulty factors, one should also consider the complexity of relations among classes. This paper introduces a new method for examining the characteristics of multi-class data. It is based on analyzing the neighbourhood of the minority class examples and on additional information about sim...

متن کامل

Complexity Curve: a Graphical Measure of Data Complexity and Classifier Performance Supplementary document S2: Evaluating Classifier Performance with Generalisation Curves

We discussed the role of data complexity measures in the evaluation of classification algorithms performance. Knowing characteristics of benchmark data sets it is possible to check which algorithms perform well in the context of scarce data. To fully utilise this information, we present a graphical performance measure called generalisation curve. It is based on learning curve concept and allows...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Journal of Big Data

سال: 2023

ISSN: ['2196-1115']

DOI: https://doi.org/10.1186/s40537-023-00724-5